6,286 research outputs found
The Loss Rank Principle for Model Selection
We introduce a new principle for model selection in regression and
classification. Many regression models are controlled by some smoothness or
flexibility or complexity parameter c, e.g. the number of neighbors to be
averaged over in k nearest neighbor (kNN) regression or the polynomial degree
in regression with polynomials. Let f_D^c be the (best) regressor of complexity
c on data D. A more flexible regressor can fit more data D' well than a more
rigid one. If something (here small loss) is easy to achieve it's typically
worth less. We define the loss rank of f_D^c as the number of other
(fictitious) data D' that are fitted better by f_D'^c than D is fitted by
f_D^c. We suggest selecting the model complexity c that has minimal loss rank
(LoRP). Unlike most penalized maximum likelihood variants (AIC,BIC,MDL), LoRP
only depends on the regression function and loss function. It works without a
stochastic noise model, and is directly applicable to any non-parametric
regressor, like kNN. In this paper we formalize, discuss, and motivate LoRP,
study it for specific regression problems, in particular linear ones, and
compare it to other model selection schemes.Comment: 16 page
Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science
As the field of data science continues to grow, there will be an
ever-increasing demand for tools that make machine learning accessible to
non-experts. In this paper, we introduce the concept of tree-based pipeline
optimization for automating one of the most tedious parts of machine
learning---pipeline design. We implement an open source Tree-based Pipeline
Optimization Tool (TPOT) in Python and demonstrate its effectiveness on a
series of simulated and real-world benchmark data sets. In particular, we show
that TPOT can design machine learning pipelines that provide a significant
improvement over a basic machine learning analysis while requiring little to no
input nor prior knowledge from the user. We also address the tendency for TPOT
to design overly complex pipelines by integrating Pareto optimization, which
produces compact pipelines without sacrificing classification accuracy. As
such, this work represents an important step toward fully automating machine
learning pipeline design.Comment: 8 pages, 5 figures, preprint to appear in GECCO 2016, edits not yet
made from reviewer comment
Tracing the Evolution of Physics on the Backbone of Citation Networks
Many innovations are inspired by past ideas in a non-trivial way. Tracing
these origins and identifying scientific branches is crucial for research
inspirations. In this paper, we use citation relations to identify the
descendant chart, i.e. the family tree of research papers. Unlike other
spanning trees which focus on cost or distance minimization, we make use of the
nature of citations and identify the most important parent for each
publication, leading to a tree-like backbone of the citation network. Measures
are introduced to validate the backbone as the descendant chart. We show that
citation backbones can well characterize the hierarchical and fractal structure
of scientific development, and lead to accurate classification of fields and
sub-fields.Comment: 6 pages, 5 figure
Machine Learning for Quantum Mechanical Properties of Atoms in Molecules
We introduce machine learning models of quantum mechanical observables of
atoms in molecules. Instant out-of-sample predictions for proton and carbon
nuclear chemical shifts, atomic core level excitations, and forces on atoms
reach accuracies on par with density functional theory reference. Locality is
exploited within non-linear regression via local atom-centered coordinate
systems. The approach is validated on a diverse set of 9k small organic
molecules. Linear scaling of computational cost in system size is demonstrated
for saturated polymers with up to sub-mesoscale lengths
Detection of trend changes in time series using Bayesian inference
Change points in time series are perceived as isolated singularities where
two regular trends of a given signal do not match. The detection of such
transitions is of fundamental interest for the understanding of the system's
internal dynamics. In practice observational noise makes it difficult to detect
such change points in time series. In this work we elaborate a Bayesian method
to estimate the location of the singularities and to produce some confidence
intervals. We validate the ability and sensitivity of our inference method by
estimating change points of synthetic data sets. As an application we use our
algorithm to analyze the annual flow volume of the Nile River at Aswan from
1871 to 1970, where we confirm a well-established significant transition point
within the time series.Comment: 9 pages, 12 figures, submitte
A General Optimization Technique for High Quality Community Detection in Complex Networks
Recent years have witnessed the development of a large body of algorithms for
community detection in complex networks. Most of them are based upon the
optimization of objective functions, among which modularity is the most common,
though a number of alternatives have been suggested in the scientific
literature. We present here an effective general search strategy for the
optimization of various objective functions for community detection purposes.
When applied to modularity, on both real-world and synthetic networks, our
search strategy substantially outperforms the best existing algorithms in terms
of final scores of the objective function; for description length, its
performance is on par with the original Infomap algorithm. The execution time
of our algorithm is on par with non-greedy alternatives present in literature,
and networks of up to 10,000 nodes can be analyzed in time spans ranging from
minutes to a few hours on average workstations, making our approach readily
applicable to tasks which require the quality of partitioning to be as high as
possible, and are not limited by strict time constraints. Finally, based on the
most effective of the available optimization techniques, we compare the
performance of modularity and code length as objective functions, in terms of
the quality of the partitions one can achieve by optimizing them. To this end,
we evaluated the ability of each objective function to reconstruct the
underlying structure of a large set of synthetic and real-world networks.Comment: MAIN text: 14 pages, 4 figures, 1 table Supplementary information: 19
pages, 8 figures, 5 table
Expected exponential loss for gaze-based video and volume ground truth annotation
Many recent machine learning approaches used in medical imaging are highly
reliant on large amounts of image and ground truth data. In the context of
object segmentation, pixel-wise annotations are extremely expensive to collect,
especially in video and 3D volumes. To reduce this annotation burden, we
propose a novel framework to allow annotators to simply observe the object to
segment and record where they have looked at with a \$200 eye gaze tracker. Our
method then estimates pixel-wise probabilities for the presence of the object
throughout the sequence from which we train a classifier in semi-supervised
setting using a novel Expected Exponential loss function. We show that our
framework provides superior performances on a wide range of medical image
settings compared to existing strategies and that our method can be combined
with current crowd-sourcing paradigms as well.Comment: 9 pages, 5 figues, MICCAI 2017 - LABELS Worksho
Estimating the Expected Value of Partial Perfect Information in Health Economic Evaluations using Integrated Nested Laplace Approximation
The Expected Value of Perfect Partial Information (EVPPI) is a
decision-theoretic measure of the "cost" of parametric uncertainty in decision
making used principally in health economic decision making. Despite this
decision-theoretic grounding, the uptake of EVPPI calculations in practice has
been slow. This is in part due to the prohibitive computational time required
to estimate the EVPPI via Monte Carlo simulations. However, recent developments
have demonstrated that the EVPPI can be estimated by non-parametric regression
methods, which have significantly decreased the computation time required to
approximate the EVPPI. Under certain circumstances, high-dimensional Gaussian
Process regression is suggested, but this can still be prohibitively expensive.
Applying fast computation methods developed in spatial statistics using
Integrated Nested Laplace Approximations (INLA) and projecting from a
high-dimensional into a low-dimensional input space allows us to decrease the
computation time for fitting these high-dimensional Gaussian Processes, often
substantially. We demonstrate that the EVPPI calculated using our method for
Gaussian Process regression is in line with the standard Gaussian Process
regression method and that despite the apparent methodological complexity of
this new method, R functions are available in the package BCEA to implement it
simply and efficiently
- …